Amanda Birmingham, CCBB, UCSD (abirmingham@ucsd.edu)
Related Notebooks:
The heart of 16S microbiome analysis is the assignment of samples to the microbial "species" (technically "operational taxonomic units", or OTUs) from which they originated. This step is known as "OTU picking", and its results are strongly affected by the picking method used. The available picking methods are differentiated primarily by the sort of set of 16S sequences are used as the reference, defining the OTUs to which new sequences can be assigned; they are:
More details on these approaches can be found at http://qiime.org/tutorials/otu_picking.html?highlight=otu%20picking . Following the QIIME recommendations, we prefer open-reference picking in situations in which the customer has no explicit preference otherwise (which is likely to be most of them!)
Also unless otherwise specified by the customer, the Greengenes set of 16S gene sequences is our preferred reference. Because this set is widely used, its current version is included as part of the QIIME install. The path to the fasta file containing the Greengenes 16S sequences is listed in the QIIME config, and can be seen in the output of the command
print_qiime_config.py
In the QIIME config values
beginning with assign_taxonomy_reference_seqs_fp
, which will look something like
assign_taxonomy_reference_seqs_fp: /usr/local/lib/python2.7/dist-packages/qiime_default_reference/gg_13_8_otus/rep_set/97_otus.fasta
The file-path shown here will be used as input to the OTU picking process.
Again, in QIIME, OTU picking can be accomplished with a single command-line call--but this one usually takes a while to process, and therefore benefits greatly from being run in parallel. The -a
switch tells QIIME to use parallel processing, and the number associated with the -O
switch tells it how many processors to use. It is best not to use ALL the CPUs on your cluster, but to leave at least one for other necessary processing; for example, with a 3-node c3.2xlarge cluster, which has 3*8=24 CPUs, you might specify 22 or 23 CPUs.
The full command has the format
pick_open_reference_otus.py -a -O [number of CPUs] -i [sequences file path] -r [reference file path] -o [output directory path]
and an example looks like
pick_open_reference_otus.py -a -O 23 -i /data/library_split/seqs.fna -r /usr/local/lib/python2.7/dist-packages/qiime_default_reference/gg_13_8_otus/rep_set/97_otus.fasta -o /data/open_ref_output/
I have seen this step take from 20 minutes to 2 hours, depending on data size, on the three-node cluster described above.
The config file on the QIIME AMI used for this cluster routes temporary files to the /data
directory, so during the run you may see various files appear there; these have various prefixes and extensions that look like .e[numbers] or .o[numbers] (ex: >ALIGN_9P6_0.o49
, POTU_VuwL_13.e14
, etc) The "e" files are errors from individual jobs, while the "o" files are logs from individual jobs. QIIME is supposed to clean all of these up on completion, but I have rarely seen this. However, aside from clearing them out when your job is done, you don't need to give them any particular attention as the results are summarized in the log_[datetime].txt
file; note that many of the "errors" at particular steps are dealt with successfully at later steps, so just because you see error files doesn't mean the run failed. To check that it hasn't, it is a good idea to skim through the log to ensure that you don't see any text listed in the "Stderr:" fields for each logged command.
The pick_open_reference_otus.py
command generates a lot of output; a high-level overview is given on the generated index.html
page, and details are available at http://qiime.org/scripts/pick_open_reference_otus.html?highlight=pick_open_reference_otus , but only a few of the outputs are used directly in subsequent steps. These are:
otu_table_mc2_w_tax_no_pynast_failures.biom
: The OTU table in biom format; basically, this is a very large, usually very sparse table listing the number of reads for each identified OTU from each sample. This particular version of the table is the one excluding OTUs with fewer than 2 sequences and sequences that fail to align with PyNAST, and including OTU taxonomy assignments. It is thus the "cleanest" version and the one you'll want to use going forward.rep_set.fna
: A fasta file of one representative sequence from each identified OTU (note that there are several different ways to choose the representative sequences, and generally it is ok just to use whatever pick_open_reference_otus.py
's default is).rep_set.tre
: A phylogenetic tree of the reference sequences for the identified OTUs, describing their inferred evolutionary relationships.In addition to the all-important OTU table as well as the representative sequence set and its phylogenetic tree, you need one more piece of information from the OTU-picking output--but this one requires a judgment call from you as the analyst. This needed info is the sequence depth that will be used in the diversity analyses. As stated in the QIIME tutorial at http://nbviewer.ipython.org/github/biocore/qiime/blob/1.9.1/examples/ipynb/illumina_overview_tutorial.ipynb , "Many of the analyses that follow require that there are an equal number of sequences in each sample, so you need to review the counts/sample detail and decide what depth you'd like. Any samples that don't have at least that many sequences will not be included in the analyses, so this is always a trade-off between the number of sequences you throw away and the number of samples you throw away."
You may wonder why we look at the counts/sample information to make this decision NOW, rather than having done so earlier when we got that information from split_libraries_fastq.py
. The reason is that some reads get filtered out during the OTU picking process--possibly even quite a lot of them, for certain data sets and certain picking methods. Therefore, it is necessary to make the sequence depth decision based on the revised distribution of counts per sample after the OTU picking is complete.
To generate a listing of this distribution, run the summarize-table
command from the biom
software (included as part of the QIIME install) on the otu_table_mc2_w_tax_no_pynast_failures.biom
file:
biom summarize-table -i [biom table]
as in this example:
biom summarize-table -i /data/open_ref_output/otu_table_mc2_w_tax_no_pynast_failures.biom
This will print summary information about counts/sample to stdout; you may well want it in a more persistent format so that it is easy to re-sort, etc, so you may want to pipe it to a file.
The output looks, in part, like this:
Num samples: 399
Num observations: 33667
Total count: 6125023
Table density (fraction of non-zero values): 0.019
Counts/sample summary:
Min: 0.0
Max: 38638.0
Median: 14466.000
Mean: 15350.935
Std. dev.: 7720.364
Sample Metadata Categories: None provided
Observation Metadata Categories: taxonomy
Counts/sample detail:
925.sm1z: 0.0
925.sl1z: 0.0
925.sn2x: 0.0
925.waterfilter.ij.50: 0.0
925.sm4y: 1.0
925.waterfilter.ik.50: 1.0
925.sn4z: 1.0
925.in2z: 1.0
925.in2y: 1.0
925.sm3z: 1.0
925.sl1x: 1.0
925.sl3x: 1.0
925.so2y: 1.0
925.waterfilter.ie.50: 1.0
925.sl4z: 1.0
925.sk5z: 1.0
925.sn5x: 1.0
925.waterfilter.il.50: 1.0
925.sn3x: 1.0
925.io2z: 1.0
925.sl5y: 1.0
925.ia3y: 1.0
925.waterfilter.id.50: 1.0
925.waterfilter.ic.120: 1.0
925.bisonfeces1: 1.0
925.bisonfeces3: 1.0
925.waterfilter.ib.80: 1.0
925.sm1y: 1.0
925.im2z: 1.0
925.so3x: 1.0
925.waterfilter.ih.50: 2.0
925.sn3z: 2.0
925.sk5y: 3.0
925.so3y: 3.0
925.y3ntc.h12: 330.0
925.sg5x: 1383.0
925.if2z: 1580.0
925.y4ntc.h12: 2159.0
925.ie5x: 4966.0
925.ie5y: 5709.0
925.sg5y: 5888.0
925.if5y: 6256.0
925.sf2y: 6644.0
The most useful part of the file is the Counts/sample detail
section, which shows an ascending-sorted list of counts per sample. For example, in the example show above, we know from the Num samples
line that there are 399 samples (although most are not shown in this snippet of output), and we see that 4 of them have zero reads, 25 have one read, 2 have two reads, 2 have three reads, 1 has 330 reads, and all the rest have more than 1000 reads.
Using this information, you need to decide what the minimum sequencing depth that all samples included in further analyses need to have (and thus, which samples to leave out). As noted in the quoted text above, for those samples that have MORE than the minimum sequencing depth, the samples will be included, but only a subset of their reads up to the minimum sequencing depth will be used in the further analyses, so you are also deciding how many reads to leave out. If the customer has specified a preferred minimum sequencing depth, that makes the choice easy! However, most users savvy enough to have done this are also savvy enough to do their own analyses, so you are unlikely to get that information from a customer. However, the customer may have shared some information that could inform your decision:
If no guidance exists, a good rule of thumb is to favor samples over sequences. The reason for this preference is that current research indicates that "low" sequence depths (10s - 1000s of sequences per sample) often capture all but very subtle variations, as shown in panels (a) and (b) of Figure 2 from Kuczynski et al., Direct sequencing of the human microbiome readily reveals community differences, Genome Biology, 2010:
These panels show "(a) The full dataset (approximately 1,500 sequences per sample); (b) the dataset sampled at only 10 sequences per sample, showing the same pattern".
Keep in mind that even with such constraints, some samples may be beyond use. For example, anything with fewer than 15 reads is unlikely to give any usable diversity information. Taking the sequencing depth down to numbers in the 10s or low 100s should be considered very cautiously: only gross differences will be detectable, and the vast majority of read information will be discarded.
Note that you may want the minimum sequencing depth to be the number of samples associated with a least-read-endowed sample to be included. For example, given the counts/detail shown above, picking a depth of 1000 would indeed lead sample 925.sg5x to be included in the samples analyed downstream, but would only include 1000 of its 1383 sequences (and 1000 of each higher-read sample's reads) in all future analyses. By comparison, selecting the maximum number of reads that still allows the inclusion of sample 925.sg5x, or 1383, ensures that as many reads as possible are used and thus the best power is gained. However, some researchers may nonetheless want to use a nice round number, for easier mental comparison to other experiments.
Once you have decided on the sequence depth to use, make a note of it. Also gather general information about how this choice affects the sample set (for example, "using a sequencing depth of 1383, 34 of 399 samples, or 8.5%, are removed from further processing") and include it in the report to the customer.
In [ ]: